Floating Point Fault Tolerance with Backward Error Assertions

نویسندگان

  • Daniel L. Boley
  • Gene H. Golub
  • Samy Makar
  • Nirmal R. Saxena
  • Edward J. McCluskey
چکیده

This paper introduces an assertion scheme based on the brpwprd errw amlysis for error detection in algorithms that solve dense systems of linear equations, A z = b. Unlike previous methods, this Backward Error Assertion Model is specifically designed to operate in an environment of floating point arithmetic subject to round-off errors, and it can be easily instrumented in a Watchdog processor envjronment. The complexity of verifying assertions is O(n2) , compared to the O(n3) complexity of algorithms solving A z = b. Unlike other proposed error detection methods, this assertion model does not require any encoding of the matrix A. Experimental results under various error models are presented to validate the effectiveness of this assertion scheme.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

- - - - an Application - Oriented Approach to Distributed Error - Detecting Branch & Bound †

An important aspect which is often overlooked in software design of distributed environments is that of fault tolerance. Many methodologies in the past have attempted to provide fault tolerance efficiently, but have never been successful at eliminating explicit time and space redundancy. One approach is the Application-Oriented Fault Tolerance Paradigm, which provides fault tolerance by examini...

متن کامل

Teraflops Supercomputer: Architecture and Validation of the Fault Tolerance Mechanisms

ÐIntel Corporation developed the Teraflops supercomputer for the US Department of Energy (DOE) as part of the Accelerated Strategic Computing Initiative (ASCI). This is the most powerful computing machine available today, performing over two trillion floating point operations per second with the aid of more than 9,000 Intel processors. The Teraflops machine employs complex hardware and software...

متن کامل

Concurrent Error-Detection and Modular Fault-tolerance in a 32-bit Processing Core for Embedded Space Flight Applications

This paper describes the concurrent error-detection methods employed in the ERC32, a 32-bit processing core for embedded space flight applications. The processor core consists of three devices; an integer unit, a floating point unit and a memory controller. All three devices are provided with internal concurrent error-detection, mainly to detect transient errors. Over 98% of all latched errors ...

متن کامل

Aspect Oriented Software Fault Tolerance

Software fault tolerance demands additional tasks like error detection and recovery through executable assertions, exception handling, diversity and redundancy based mechanisms. These mechanisms do not come for free, rather they introduce additional complexity to the core functionality. This paper presents light weight error detection and recovery mechanisms based on the rate of change in signa...

متن کامل

COFTA: Hardware-Software Co-Synthesis of Heterogeneous Distributed Embedded Systems

Embedded systems employed in critical applications demand high reliability and availability in addition to high performance. Hardware-software co-synthesis of an embedded system is the process of partitioning, mapping, and scheduling its specification into hardware and software modules to meet performance, cost, reliability, and availability goals. In this paper, we address the problem of hardw...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • IEEE Trans. Computers

دوره 44  شماره 

صفحات  -

تاریخ انتشار 1995